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Abstract 

Latent Dirichlet Allocation models discrete 
data as a mixture of discrete distribu- 
tions, using Dirichlet beliefs over the mixture 
weights. We study a variation of this con- 
cept, in which the documents' mixture weight 
beliefs are replaced with squashed Gaussian 
distributions. This allows documents to be 
associated with elements of a Hilbert space, 
admitting kernel topic models (KTM), mod- 
elling temporal, spatial, hierarchical, social 
and other structure between documents. The 
main challenge is efficient approximate infer- 
ence on the latent Gaussian. We present an 
approximate algorithm cast around a Laplace 
approximation in a transformed basis. The 
KTM can also be interpreted as a type of 
Gaussian process latent variable model, or as 
a topic model conditional on document fea- 
tures, uncovering links between earlier work 
in these areas. 



1 Introduction 

Latent Dirichlet Allocation (LDA) [Blei et al., 2003] 
is a generative model for datasets comprising collec- 
tions of discrete samples. Each collection is assumed to 
be generated from a mixture of discrete distributions, 
such that both the belief over the discrete distribu- 
tions and over the mixture weights are Dirichlet. Text 
documents constitute the most popular domain with 
this anatomy: Each document in a corpus, treated as a 
"bag of words" (i.e. ignoring word order), is one collec- 
tion of (discrete) words, and the mixture components 
are interpreted as topics. Thus, each document ex- 
hibits several topics to varying degree, with each word 
in the document sampled from one specific topic. 

Real documents do not exist void of context. They are 
products of their authors, time, and place. Electronic 



communication has intensified this truism, and online 
corpora are now invariably accompanied by copious 
amounts of meta-data. The identity of the author may 
be augmented by additional knowledge about their lo- 
cation in a social graph, autobiographic information, 
and many more. Such features convey semantic in- 
formation: Topic popularity varies between West and 
East, conservatives and progressives, rich and poor, 
scientists and celebrities, young and old, contempo- 
raries and forebears. 

In its standard form, LDA can not take advantage of 
such metadata; but extensions proposed by several au- 
thors have addressed certain types of meta-structure. 
Dynamic development of topics over sequential sets of 
documents was considered by Blei and Lafferty [2006] , 
Wang and McCallum [2006] and Wang et al. [2009]. 
Both Mimno and McCallum [2008] and Zhu and Xing 
[2010] considered a more general description of topics 
in terms of a linear function in a latent real vector 
space, linked to the topic dimension through the soft- 
max function. These works differ in their details (some 
assume the topics stay constant over time while their 
distribution changes, others that the topics themselves 
change. Words may be assumed to generate features, 
or the other way round), but are linked by their com- 
mon use of Gaussian random variables to describe dy- 
namics or regress on document features. They also 
all use maximum likelihood, or maximum a-posteriori 
inference to fit regression weights where they exist. 

This work generalizes these approaches by replacing 
real-valued features with elements of a Hilbert space, 
and point estimates with Gaussian process measures 
(Figure 1). The resulting kernel topic model provides 
an expressive framework for the inclusion of virtually 
all types of metadata in the semantic description of 
topical data, and allows a rich description of nonlinear 
topic dynamics. The main mathematical challenge is 
that inference on the latent Gaussian belief is not ana- 
lytically tractable. We address this through a numer- 
ically lightweight Laplace approximation for Dirichlet 
distributions in the softmax basis, extending on a note 
by MacKay [1998]. As a side effect, this approximation 
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Figure 1: Dimensionality- reduction view of topic mod- 
els. Top: LDA describes D documents containing 
words from a vocabulary of size V in terms of K top- 
ics. Middle: Dirichlet multinomial regression por- 
trays the documents in terms of F features, which 
generate the topics through a linear map. Bottom: 
The kernel topic model replaces the features with co- 
ordinates of a Hilbert space %, and the linear map with 
a nonlinear one. The curly brace denotes a softmax- 
projection from M. K to the [0, 1] K simplex. 



also admits a particularly efficient implementation of 
Bayesian inference on linear latent models, such as the 
one introduced by Mimno and McCallum [2008]. The 
kernel topic model links topic modelling and Gaussian 
process latent variable models, effectively casting LDA 
as a likelihood for generalised Gaussian process mod- 
els. The price of the increased modelling flexibility is 
a comparably high computational cost - cubic in the 
number of documents. 



2 Methods 



2.1 Model 

We consider a corpus of D documents. Document d 
contains I d words w d i € {1, . . . , V}, d e {1, . . . , D}, i 6 
{1, . . . ,Id} from a vocabulary of size V. Additional 
aspects of d are described by features ^ e M in a 
Hilbert space H. In other words, the dataset consists 
of pairs (w d , 4> d ) 6 {1, . . . , V} 1 " x U. 

We construct a topic model conditional on the observ- 
able features of the documents, using the following 
generative process for the vector Wd from K topics: 



• For each topic k G {1, . . . , K}, generate a discrete 
probability distribution with parameters 6 k G 
[0, 1} V over the vocabulary of size V by sampling 
from a Dirichlet distribution with parameter vec- 



tor (3 k (r denotes the Gamma function): 

r (Ylv &kv 



p(6 k \/3 k )=V(9 k ;f3 k ) 



(1) 



• Independently sample K functions h k {(j>) : H -> 
R from the Hilbert space of real- valued functions 
over %, by sampling from Gaussian process priors 
with mean functions Hk(4>d) an d covariance func- 
tions Efc(</> d , d> d ,), induced by (potentially topic- 
specific) kernels r\ k : 



p{h k | fi k , S fe ) = GP(h k ; fi k , E 2 k ) 
• For each document d with features cf> d e 



(2) 



— Draw a latent variable y d by evaluating 
h(<f> d ) and adding Gaussian noise of standard 
deviation r: 

p(y d | h, r, d ) = Y[M(y dk ; h k (cj> d ),r 2 ) (3) 



Define the topic proportions n d = a(y) € 
[0, 1} K where a is the softmax function 



exp(y fc 



Ef exp(y^) 



(4) 



— For each of I d words 

* draw a topic Cdi from the discrete distri- 
bution defined by iTd'- 



p(c di = k | TV d ) = ITdk 



(5) 



* draw word Wdi from the discrete distribu- 
tion of topic Cdi'- 

p(w dl = v\c dl ,&) = 6 CdzV (6) 

The directed graphical model in Figure 2, left, sheds 
light on the dependency structure of this generative 
model. If we replace everything to the left of ir d in 
that figure by a single Dirichlet parameter vector a. 
(identical for all d), then the parts shown to the right 
of and including the node 7r correspond to the tradi- 
tional LDA model [Blei et al., 2003] (Figure 2, right). 
On the other hand, we can identify the parts to the left 
of (and excluding) 7r as a case of Gaussian process re- 
gression. It is the connection between these two parts 
that makes the model challenging, and approximate 
inference will in fact separate in this way. 

In passing, we note a connection to the correlated topic 
model [Blei and Lafferty, 2007], which shares every- 
thing to the right of and including y in Figure 2, but 
not the regression element to its left. Instead, that 
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Figure 2: Left: Directed graphical model of the kernel topic model. Some variables labeled for clarity. Right: 
latent Dirichlet allocation. The models are identical to the right of and including tv. 



model focusses on estimating the correlation between 
topics, which here is replaced by a simpler, diagonal 
covariance. Introducing correlations between topics is 
possible in our model using an approach analogous to 
the cited work (maximum likelihood estimation on the 
covariance structure), but left out here for clarity. 

3 Inference 



prior 

p(n d \a d ) =V(n d ;a d ) (7) 

on TT d , assigns approximate Dirichlet "posterior" be- 
liefs 

p(n d \a d ,w d ) =V(n d ;a d + v d ) (8) 

with a vector v d e M. K of pseudo-counts. At the 
LDA end of the divide between Gaussian regression 
and LDA, we thus require a Dirichlet belief. 



Ample expertise has accumulated in the literature, on 
inference for both LDA, and Gaussian processes given 
(approximately) Gaussian likelihoods. What is miss- 
ing is a connection between the two paradigms. This 
link is the main contribution of this paper. To clarify 
the setting, however, we give a very brief introduction 
to the two sub-systems in this section, then derive the 
link - the Laplace bridge - in Section 3.3. 

3.1 Semi-Collapsed Variational Inference 

Broadly speaking, there are two popular methods for 
inference in LDA: variational inference [Blei et al., 
2003] and collapsed Gibbs sampling [Griffiths and 
Steyvers, 2004]. Gibbs samples come from the exact 
posterior, but provide no analytic form for the be- 
liefs. Since our extension benefits from such forms, 
we opt for a variational approximation. Standard in- 
ference in LDA [Blei et al., 2003, Blei and Lafferty, 
2009, Hoffman et al., 2010] uses a fully factorized ap- 
proximate distribution, but Teh et al. [2007] showed 
that this Ansatz entails an unnecessarily loose bounds 
and slow convergence. To mitigate this problem, latent 
variables should be integrated out wherever possible. 
Since we require explicit forms for the per-documcnt 
topic distributions ix dl we can not integrate out this 
variable, but we can collapse the bound on the per- 
topic distributions 6. This amounts to an adaptation 
to Teh et al.'s work, which we do not dwell on here 
for brevity. The bottom line is that it is possible to 
construct a variational bound that, given a Dirichlet 



3.2 Gaussian Process Regression 

For the moment, assume there be some isomorphism 
£ between if -dimensional Dirichlet distributions and 
K approximately independent Gaussian ones (to be 
developed in Section 3.3). 

K 

£: V(Tr d ;cx d )^l[Af(y d ;n dk ,a 2 dk ) (9) 

k=i 

This transform provides approximate Gaussian mes- 
sages from -K d to y d in the graph of Figure 2. With 
these messages, Gaussian process inference over the 
Hilbcrt space % becomes a known problem, and we 
can implement an approximate Gaussian process la- 
tent inference algorithm: For every topic fc, the poste- 
rior belief over the function h k (</>») at the Hilbcrt lo- 
cation cj)^ is the product of the Gaussian process prior 
and the D approximately independent Gaussian mes- 
sages p{y d | h((p d ), W, 0) = N{nkd\ h k {4> d ), t 2 + a 2 d ). 
We subsume the means of these messages into a vec- 
tor fi k and their variances into a diagonal matrix 
Sfe = diag(r 2 + <J dk ), which allows us to write the 
mean and marginal variance functions of the posterior 
Gaussian process as 

v[/u] = </>*) - m{4>*, *)(H + s)- 1 ^*, 0J 

(10) 

writing the message precisions (inverse variances) as 
Cd = (o~ d + t 2 )^ 1 , we construct a matrix S = diag(£) 
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and message precision adjusted means v = Sfi. Us- 
ing this notation, the implementation of iterative 
Gaussian process inference from approximate Gaus- 
sian messages contained in Section 3.6.3, in particular 
Algorithms 3.5 and 3.6 in Rasmussen and Williams 
[2006] can be used almost without changes. 

In Gaussian process generalised regression, the hy- 
perparameters (kernel parameters and observation 
noise) are usually estimated by evidence maximisation 
("type-II maximum likelihood"). In our case, the un- 
known function is h, the data is w and let the hyperpa- 
rameters be £. Evidence maximisation would amount 
to optimising p(w | £) = / p(w | /, £)p(f | £) df. How- 
ever, in our case, there is an approximate inference 
algorithm separating the Gaussian process regression 
from the observed data, so this kind of optimisation 
has exceedingly high computational cost (each eval- 
uation of p(w |£) involves running the LDA part of 
Section 3.1 to convergence). Instead, a much cheaper, 
if less effective, method is to maximise p(y | £), where 
y are the estimated per-document topic distributions. 
Defining the matrix B = I + S 1/2 KS 1/2 , a simple al- 
gebraic argument similar to the one in Rasmussen and 
Williams [2006], Section 3.6.3, gives the log evidence 



\ogZ 



log |5| - log \B\ - fjJS 1/2 B- 1 S 1/i ii 



(11) 

which is numerically stable (because all eigenvalues of 
B are larger than 1), and can be implemented effi- 
ciently. Derivatives of log Z with respect to the ker- 
nel parameters, required for efficient optimisation, are 
straightforward to calculate using linear algebra iden- 
tities. 

3.3 The Laplace Bridge 

To link these two parts of the inference, we must con- 
nect the Dirichlet belief on TTd and the Gaussian do- 
main required for y d . Since cr(y d ) = tt^, this task 
amounts to an uncertain form of logistic regression, in 
the sense that discrete samples Cd n from the distribu- 
tion defined by 7r d are replaced by probabilistic beliefs 
over Cdn- Our solution to this problem is to construct 
a Laplace approximation to Dirichlet distributions in 
the softmax basis, in which these distributions can be 
approximated by Gaussians much better than in the 
popular simplex basis. 

MacKay [1998] showed that, because the softmax func- 
tion has a Jacobian proportional to nfe 71 "^ a basis 
change from probabilities 7r to real numbers y = 
(T _1 (7r) gives the Dirichlet a new parametric form 



T> y (n(y);a) 



r(£f« fc )n 



nfrK) 



H^g(Vy) (12) 



g(l T y) is an arbitrary normalisable measure, required 
to ensure integrability by restricting the sum of the 
elements of y (1 is the vector [1,1,1,...]). In this 
basis, the Dirichlet lacks the —1 terms in the exponents 
present in the standard representation, and thus docs 
not diverge for \x\ -» oo and a, < 1. It is also a 
unimodal distribution whose mode at 7r(y) = a/||a|| 
now falls together with its mean. These aspects allow 
a good quality Laplace approximation. 

For numerical convenience, we choose (like MacKay) 



9 



exp(-|(lTy) 2 ). 



(13) 



MacKay shows the Hessian of the logarithm of this 
distribution has elements 



Lu{y) = 



d 2 V(y) 



= a (Skiirk - Tr k 7Te) + e(ll J ) 



kt 



dykdyi 

(using Kronecker's 5, and a = ^2 k a k . The e stems 
from Eq. (13)). To construct a Laplace approximation 
of the Dirichlet in the form of a multivariate Gaussian 
JV(y; /x, S) (deviating from MacKay's derivations from 
here on), we identify the mean [i with the mode of the 
distribution, 



1 K 

[i k = log a k - — ^2 lo S 



(15) 



and the negative logarithm of its Hessian with X. To 
gain a sparse approximation, we analytically invert the 
Hessian. To do so, we introduce the rectangular matrix 
X £ M. Kx2 with elements X ku = TTfe^iu + lfc£>2« and 
the square matrices A £ R KxK and B £ R 2x2 



A = diag(ct) and 



B = 



-a 
e 



(16) 



which allows us to write L = A + XBX T . Both A 
and B are diagonal with strictly positive diagonal ele- 
ments, and thus invertible. Hence we can use the ma- 
trix inversion lemma, which exposes an analytically in- 
vertible 2x2 Schur complement and thus easily yields 
the inverse of the Hessian 



Ske 



1 

K 




(17) 

because this inverse is defined for all positive values of 
e, we can safely take the limit of e -> oo, i.e. g(x) -> 
S(x), to the Dirac point distribution. Note that the off- 
diagonal elements of this matrix are suppressed with 
0(1/ K), so for large K, the belief is approximately 
independent, with element-wise variances 



-*kk 



_W,2\ if 1 
a k \ K) + K 2 ^ at - 



(18) 
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Figure 3: Laplace approximations between Gaussians and Dirichlets. Left: Simplex basis. Right: Soft- 
max basis. The parameter choices for the Beta distributions (special ID case of the Dirichlet) are (a, b) = 
(2, 1.2); (0.5, 0.9); (3, 4). Under the Laplace approximation, these correspond to one-dimensional Gaussian pa- 
rameters (//, a 2 ) — (0.5, 1.3); (—0.6,3.1); (—0.3,0.6). Note that the Laplace approximation matches modes (and 
means) in the softmax basis (right), not the simplex basis (left). 



(This map is only valid for K > 2. In the 2D-case, a 
special, much simpler solution can be derived by map- 
ping directly to the real line. See also Figure 3). It is 
not hard to invert this a — > (/it, X) map from Dirichlet 
to Gaussian parameters, giving 



Oik 




Vk = l,...,K 



(19) 

Figure 3 gives an intuition for the quality and defects of 
this approximation in the 2D case. The approximation 
is very good for large entries in a, but retains good 
quality even for a < 1, which is important for topic 
models, where the prior is often sparse. 

While it has previously been investigated in MacKay 
[1998], the use of this approximation here differs con- 
siderably from the setting studied in the cited paper 
(which dealt with evidence estimation in neural net- 
works). Its use here amounts to the following: 

• Some unobserved process with known parameters 
fi, <r generates data as follows: 

— Sample x G M. K ~ Af(x; fi, S)7V(0; l T x, e 2 ) 

— Map 7r = a(x) 

— Sample data c from p(c = k \ tv) = 7r 

• The inference method tries to infer x thus: 



— Use the Laplace map to gain a Dirichlet belief 
on 7r from the Gaussian prior (15) 

— Update this belief using the data (which is 
trivial, due to the Dirichlet's conjugacy to the 
Multinomial distribution) 

— Use the Laplace map in the opposite direc- 
tion, to get a Gaussian belief on R k , claim 
the resulting belief to be an approximate pos- 
terior on x 

Figure 4 compares this approximate scheme to an 
asymptotically exact Markov Chain Monte Carlo 
scheme (the particular MCMC method chosen for this 
task is elliptical slice sampling [Murray ct al., 2010], 
which has the advantage of having no free parame- 
ters). The figure shows the 2-norm error of a point 
estimate for x returned by the two methods (solid 
lines) and error estimates constructed from the algo- 
rithms' results. For the MCMC sampler, these two 
estimates are the sample mean and (unbiased) sam- 
ple covariance. For the Laplace approximations, the 
two estimates are the mean and standard deviation 
of the approximate Gaussian belief. The prior mean 
and covariance were sampled, for each experiment sep- 
arately, from the standard Gaussian and the standard 
inverse Wishart distribution, respectively. The num- 
ber of dimensions was set to K = 10. Note that the 
Laplace bridge does not show any discernible bias or 
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over- convergence. Its only two apparent drawbacks are 
its relatively bad fit for a —> and that covariance can 
not be captured by the Dirichlet. 
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Figure 4: Convergence behaviour of approximate in- 
ference using the Laplace bridge compared to MCMC 
inference. Solid lines represent deviation of mean es- 
timate (sample mean for MCMC) from ground truth, 
dashed lines the error estimate of the inference algo- 
rithm (one standard deviation). Both methods were 
initialised with a prior of /i = 0,cr = 1. Plots are 
averages over 12 independent experiments. 

3.4 The Wider View 

Within the wider context of unsupervised learning 
methods, the kernel topic model establishes a connec- 
tion between conditional topic models and Gaussian 
process latent variable models (GPLVM) [Lawrence, 
2004]. GPLVMs learn mappings from data-space to 
a lower-dimensional space, assuming the generative 
model for the data in the latent space is a Gaussian 
process. In the case of the kernel topic model, the 
variational part of the inference learns a mapping from 
the ^-dimensional space of documents defined by their 
words to the iC-dimensional space of topics defined by 
their topics (Figure 1), where the documents' topics 
are assumed to be generated by a Gaussian process. 
However, in GPLVMs the map between data and their 
low-dimensional representation is usually assumed to 
be generated by another Gaussian process. In the ker- 
nel topic model, the lower dimensional distributions 
are discrete, and sampled from Dirichlet distributions. 
The kernel topic model thus performs Gaussian pro- 
cess regression, under a "latent Dirichlet likelihood" . 



4 Experiments 

4.1 Euclidean and Discrete Spaces 

We compare the kernel topic model to its conceptually 
closest competitor, the Dirichlet-Multinomial Regres- 
sion (DMR) model by Mimno and McCallum [2008], 
which was, in the cited work, shown to give superior 
results to a number of other models, such as topics 
through time [Wang and McCallum, 2006] and the au- 
thor topic model [Rosen-Zvi et al., 2004]. The dataset 
consists of the annual State Of The Union addresses by 
US presidents to the joint chambers of Congress, an- 
notated with both the speaker's identity and the year 
of delivery. This dataset is interesting because it com- 
bines continuous features (time) with 44 discrete ones 
(author identity) and thus falls outside of the descrip- 
tive power of time drift models like the one by Wang 
et al. [2009]. All models used K = 10 topics. 

For the linear model of DMR, we represented time us- 
ing 100 radial basis functions spaced evenly through 
the time period from years 1790 to 2011, each with 
a width of 5 years, and used 44 binary author indi- 
cator features. For the kernel topic model, we used 
a rational quadratic kernel [Matern, 1960, Rasmussen 
and Williams, 2006] on the Hilbert space of time and 
author identity, assigning a distance between docu- 
ments linear in time (initially using the same scale of 

5 years), with an additional constant term if the au- 
thors of two documents are not the same. The rational 
quadratic kernel is equivalent to an infinite scale mix- 
ture of square exponential kernels: It assigns nonzero 
mass to functions with a range of length scales, while 
the the square exponential (for which the radial basis 
functions of the linear model are a finite-dimensional 
approximation) can only construct functions of a sin- 
gle length scale. So the kernel model is strictly more 
expressive than the linear model in this case. In addi- 
tion, the evidence maximisation description as intro- 
duced in Section 3.2 allows an optimisation of the ker- 
nel parameters during training. For DMR, this would 
amount to optimising the feature set, rather than the 
feature parameters, which is more difficult to do effi- 
ciently. 

Figure 5 shows the consequences of this additional 
expressive power: The kernel model captures inter- 
esting detail in the development of American interior 
and foreign policy, including long-term developments 
like the industrial revolution (bright red topic at bot- 
tom of plot) and faster developments like the Spanish- 
American war (light blue, top). 

Figure 6 shows the development of the perplexity score 
[Rosen-Zvi et al., 2004] of the two models, on the train- 
ing set, during training on three different datasets (see 
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Figure 5: Inferred topic distribution of State Of The Union addresses by US American presidents. Top: kernel 
topic model using a rational quadratic kernel on the 45 dimensional space of authors and time; Bottom: Linear 
model using 100 radial basis functions in time and 44 binary author features. To generate this plot, either model 
was used to predict the topic distributions at the given date, conditioned on the author being the president in 
office at that time. 



caption for details on datasets). (The vocabulary size 
for this dataset is V = 5000, so the initial perplex- 
ity is 5000.) Optimisation of kernel hyperparameters 
was performed every tenth variational loop, and is vis- 
ible as a discrete steps in the plots when it has non- 



negligible effect, thus also giving an intuition for the 
model performance without hyper-optimisation. 

The kernel topic model converges about as fast as the 
DMR, but achieves a final score about 12% below that 
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of DMR. The two methods' runtimes are roughly com- 
parable on our datasets: Both models share the LDA 
part. In the regression part, DMR requires numerical 
optimisation of the feature weights, while the kernel 
topic model requires inverting a large matrix. 

4.2 Topics on Graphs 

The kernel view on topic models also allows a relatively 
elegant treatment of non-Euclidean feature spaces. As 
an example, we construct a topic model on a graph. 
For our experiment, D = 318 documents were taken 
from Wikipedia's "list of probability topics 1 " . We con- 
struct a positive definite kernel by embedding the doc- 
uments in the R D Euclidean vector space, setting 



2,400 



k(d 1 ,d 2 ) 



sexp(--(a;i 



x 2 yS( Xl - sea)) (20) 



where the vector elements the shortest dis- 

tances, on the graph of links between documents, from 
document d to document i (links are interpreted as 
undirected edges, documents not linked by any path 
are assigned infinite distance), s and S — diag 4 (Si) are 
parameters. Of course it is possible to define corre- 
sponding linear features, but the kernel view arguably 
allows a more natural way of deriving such measures. 

5 Conclusion 

We have presented the kernel topic model, allowing 
nonparametric regression of topics on document meta- 
data of various kinds. The model is a combination of 
Gaussian process regression and latent Dirichlct allo- 
cation; these two conditionally independent parts are 
linked efficiently through a lightweight Laplace ap- 
proximation. Inference in the kernel topic model is 
cubic in the number of documents. In large corpora, 
this can compare unfavourably to other feature-based 
topic models, but it offers superior power of expression 
for small and medium-sized corpora, where (approx- 
imate) analytic Gaussian process inference can even 
be faster than EM optimization of point estimates. 
An elegant side-effect of the Laplace approximation, 
which we have only touched upon marginally in this 
paper, is that it replaces the point estimates of earlier 
approaches with a full Bayesian belief. This means 
that topics can be predicted with uncertainty, and that 
hyperparameters of the model can be inferred consis- 
tently, using higher order maximum likelihood (maxi- 
mum evidence) optimisation. 
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Figure 6: Perplexity of kernel topic model (blue) and 
linear maximum-likelihood Gaussian model or a con- 
stant LDA model (red). KTM /ii/per-parameters were 
optimised after every 10 iterations (the kernel regres- 
sion itself is updated after every document inference) . 
Top: State Of The Union dataset. Here, the hyperpa- 
rameters happened to be chosen well, optimising them 
had negligible effect on perplexity. Middle: Wiki doc- 
uments (Section 4.2). Note the spike in the perplexity 
of the kernel model in the latter plot, caused by the 
optimisation of hyperparameters - since the optimisa- 
tion is not performed directly on the word level, the 
topic model crosses over into a more perplexed state at 
this point, but this subsequently allows a better repre- 
sentation. Bottom: NIPS dataset [Globerson et al., 
2007], again showing considerable improvement after 
hyperparameter optimisation. 
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